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Objectives 

With  funding  from  this  project,  we  addressed  the  following  research  problems: 

•  The  application  of  database  principles  to  large-scale  behavioral  simulations.  As  the 
military  expands  its  mission  into  new  areas  such  as  domestic  disaster  recovery,  large- 
scale  simulations  of  individual  behavior  are  important  tools  for  analyzing  how  best  to 
address  these  situations.  However,  behavioral  simulations  are  often  complex,  and  do  not 
scale  enough  to  cover  the  population  of  a  major  urban  area.  How  can  we  leverage  the 
technology  from  massively  scalable  database  systems  to  produce  efficient  large-scale 
behavioral  simulations? 

•  A  language  for  expressing  efficient  large-scale  behavioral  simulations.  Defining  a 
large-scale  simulation  in  a  database  query  language  like  SQL  is  prohibitively  complex; 
there  is  no  way  to  isolate  or  encapsulate  individual  behavior.  What  language  features  do 
we  need  to  translate  behavioral  simulations  of  individuals  into  database  query  plans? 

•  New  algorithms  for  processing  large-scale  simulations.  The  types  of  database  query 
plans  that  correspond  to  large-scale  simulations  are  likely  very  different  from  traditional 
database  query  plans.  What  new  algorithms  can  we  develop  specifically  to  optimize 
these  types  of  query  plans? 


Executive  Summary 

With  funding  from  this  grant,  we  were  able  to  develop  a  novel  technique  by  which  simulations 
arc  converted  into  a  database  query  plan,  and  optimize  its  performance  with  database  processing 
techniques.  As  part  of  this  research  we  made  the  following  five  contributions: 

1.  We  designed  a  novel  design  pattern  for  behavioral  simulations;  any  simulation 
developed  using  this  pattern  can  be  converted  into  a  database  query  plan  (cither 
manually  or  automatically). 

2.  We  developed  a  novel  language  with  formal  semantics  that  supports  this  software 
design  pattern. 

3.  We  developed  an  optimizing  compiler  that  automatically  converts  simulations  written 
in  our  programming  language  into  database  query  plans. 

4.  We  developed  novel  algorithms  for  optimizing  the  query  plans  produced  by  our 
compiler. 


5.  We  developed  novel  techniques  for  integrating  our  technology  into  a  distributed 
simulation  platform. 
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Major  Discoveries 


1.  A  novel  programming  pattern  that  allows  behavioral  simulations  to  be  expressed  as 
database  query  plans;  from  this  pattern,  we  designed  a  novel  programming  language  and 
an  optimizing  eompiler. 

2.  Optimization  algorithms  for  simulation  query  plans  that  provide  asymptotieally  better 
performanee  over  traditional  teehniques;  this  improvement  is  several  orders  of  magnitude 
in  many  of  our  experiments. 

3.  Coneurreney-eontrol  algorithms  for  proeessing  distributed  simulations. 


Honors/A  wards 

Johannes  Gehrke  reeeived  a  Faeulty  Development  Award  from  the  New  York  State  Foundation 
for  Seienee,  Teehnology,  and  Innovation. 


Major  Technical  Contributions 


1.  A  Design  Pattern  for  Efficient  Simulated  Behavior 

A  serious  problem  with  large-seale  behavioral  simulations  is  their  eomputational  expense;  as 
eaeh  individual  may  have  to  examine  or  internet  with  every  other  individual  in  order  to 
determine  its  next  aetion,  the  eost  ean  grow  quadratieally  with  the  number  of  individuals. 
However,  mueh  of  this  eomputation  is  redundant;  individuals  with  similar  properties  see  the 
world  in  the  same  way  and  eome  to  the  same  eonelusions. 

This  provides  us  with  our  motivation  for  transforming  a  simulation  into  a  database  query  plan. 
Query  plans  proeess  data  set-at-a-time.  That  is,  at  every  point  of  the  eomputation,  they  read  in 
all  the  data  at  onee  and  perform  related  operations  on  it.  This  makes  it  easier  to  identify  and 
eliminate  redundant  eomputation,  and  is  a  large  part  of  what  makes  database  management 
systems  scalable. 

However,  database  query  languages  are  declarative  languages.  They  do  not  have  variables  or 
assignments  like  we  would  find  in  an  imperative  language  like  C/C++  or  Java.  This  is  an  issue 
for  programming  behavioral  simulations.  We  would  like  to  program  each  individual  separately, 
and  allow  it  to  interact  with  others  via  the  exchange  of  information.  Exchanging  information 
between  programs  is  difficult  without  either  variables  or  a  message-passing  framework,  neither 
of  which  is  present  in  declarative  languages. 

To  address  this  problem,  we  have  designed  a  new  programming  design  pattern  for  behavioral 
simulations,  which  we  call  the  state-effect  pattern.  This  pattern  allows  individuals  to  exchange 
information  via  variable  assignments;  however,  programs  written  in  this  pattern  can  always  be 


rewritten  to  eliminate  these  variables.  This  makes  it  possible  to  rewrite  the  simulation  in  a 
database  language  like  SQL 

In  the  state-effeet  pattern,  we  assume  that  individual  behaviors  arc  processed  together  in  discrete 
time  steps.  These  time  steps  might  correspond  to  animation  frames  or  to  some  simulated  unit  of 
time  (c.g.  1  second).  Processing  during  each  time  step  is  separated  into  a  query  phase  and  an 
update  phase.  Intuitively,  the  query  phase  reads  the  state  of  an  individual  from  a  previous  time 
step;  it  cannot  make  any  changes  to  an  individual.  On  the  other  hand,  the  update  phase  changes 
the  state  of  an  individual,  but  cannot  access  other  individual  while  doing  so.  The  advantage  of 
these  read-only  and  write-only  phases  is  that  they  can  be  rewritten  to  use  no  variables  at  all;  this 
makes  the  simulation  more  suitable  for  conversion  into  a  database-like  language. 

To  support  these  two  phases,  the  attributes  of  each  individual  arc  classified  as  cither  states  or 
effects.  These  classifications  obey  the  following  rules: 

•  State  attributes  are  read-only  in  the  query  phase. 

•  Effect  attributes  are  write-only  in  the  query  phase. 

•  Multiple  writes  to  an  effect  attribute  arc  aggregated  via  an  aggregate  function  like  sum  or 
average. 

•  Effect  attributes  are  read-only  in  the  update  phase. 

•  In  the  update  phase,  each  individual  updates  its  state  attributes  from  its  effects  and  old 
state  values. 

The  key  feature  of  this  pattern  is  that  effects  arc  aggregated.  An  aggregation  function  takes  in  a 
set  of  data,  and  combines  it  into  a  single  value;  it  does  this  independent  of  the  order  in  which  the 
data  is  received.  This  is  an  extremely  powerful  feature,  as  it  allows  us  to  dramatically  rearrange 
the  processing  of  individuals  during  the  simulation.  Wc  can  process  individuals  in  any  order,  or 
even  in  parallel.  We  can  split  up  an  individual,  processing  it  for  a  little  while,  switching  to  a  new 
individual,  and  then  returning  to  the  original  individual.  All  that  matters  is  that  all  of  the  data  for 
each  effect  attribute  arrives  before  the  end  of  the  query  phase. 

To  best  apply  this  programming  pattern,  we  need  a  custom  programming  language  that  enforces 
the  pattern  and  transforms  the  program  into  a  database  query  plan.  However,  we  believe  that  the 
identification  of  this  pattern  is  potentially  useful  for  existing  programming  languages.  While 
these  languages  may  have  features  (such  as  variable  assignments)  that  prevent  their  programs 
from  being  converted  to  a  query  plan,  compiler  extensions  may  be  able  to  optimize  the  parts  of 
their  programs  that  conform  to  the  state-effect  pattern. 


2.  A  Programming  Language  Supporting  our  Pattern  for  Efficient  Simulated 
Behavior 


In  order  to  convert  to  the  simulation  to  a  database  query  plan,  we  need  to  ensure  that  it  conforms 
to  the  state-effect  pattern.  For  this  reason,  we  have  developed  a  programming  language  for 
behavioral  simulations  that  makes  the  state  effect  pattern  explicit.  We  call  this  language 
BRASIL  (Big  Red  Agent  Simulation  Language).  BRASIL  is  an  object-oriented  language  that 
looks  superficially  like  Java.  Each  behavior  that  we  want  to  program  is  its  own  class;  all 
individuals  that  implement  this  behavior  arc  instances  of  this  class. 

Below  is  an  example  of  a  BRASIL  script.  This  script  represents  a  person  performing  a  random 
walk  through  a  crowd  in  two-dimensional  space.  The  person  uses  an  imaginary  force  to  repel 
others  nearby  in  order  to  push  her  way  through  the  crowd. 

class  Person  { 

//  Location  of  person  in  2D  space 
public  state  float  x  :  (x+vx) ; 
public  state  float  y  :  (y+vy) ; 

//  The  latest  velocity  for  the  person 

public  state  float  vx  :  vx  +  rand()  +  avoidx  /  count  *  vx; 
public  state  float  vy  :  vy  +  randO  +  avoidy  /  count  *  vy; 

/ /  Used  to  update  our  velocity 
private  effect  float  avoidx  :  sum; 
private  effect  float  avoidy  ;  sum; 
private  effect  int  count  :  sum; 

/**  The  query-phase  for  this  person.  */ 
public  void  run ( )  { 

//  Use  "forces"  to  repel  others  that  are  too  close 
foreach ( Person  p  :  Extent<Person>)  { 
p . avoidx  <-  1  /  abs (x  -  p . x) ; 
p. avoidy  <-  1  /  abs (y  -  p.y); 
p. count  <-  1; 


There  are  a  few  important  features  to  note  about  this  script.  First,  all  the  fields  are  clearly 
denoted  as  either  state  or  effect.  Each  state  field  has  an  update  rule;  this  is  an  expression  that 
takes  effects  and  old  state  values  and  uses  them  to  produce  the  new  state  value.  In  contrast,  each 
effect  field  has  an  aggregation  function  associated  with  it.  This  function  is  used  to  combine  all 
the  values  assigned  to  the  same  effect  field.  In  this  respect,  effect  fields  are  similar  to  aggregator 
variables  in  Google’s  Sawzall  language;  for  this  reason  we  use  the  Sawzall  operator  <-  for 
assignments  to  effect  fields. 

Another  thing  to  notice  in  this  script  is  the  run  ( )  method.  This  method  defines  the  query  phase 
for  this  individual.  Intuitively,  our  simulation  proceeds  as  follows: 

•  We  process  the  run()  method  for  each  individual.  This  results  in  a  collection  of  effect 
assignments. 


•  For  each  individual,  we  aggregate  all  of  the  effeet  assignments  made  to  it  in  the  query 
phase. 

•  For  eaeh  state  field  of  eaeh  individual,  we  apply  the  update  rule  to  get  the  new  value. 


In  order  to  convert  our  seripts  into  database  query  planes,  we  need  place  two  further  restrictions 
on  our  language.  First  of  all,  loeal  values  must  also  eonform  to  the  state-effeet  pattern,  so  that 
they  ean  be  eliminated  as  well.  Loeal  values  are  either  constants,  which  cannot  be  changed  after 
they  are  initialized,  or  effeets,  whieh  have  an  assoeiated  aggregator.  The  following  seript  is 
equivalent  to  our  example  above,  but  uses  local  values  to  for  the  main  computation  in  the  query 
phase: 

class  Person  { 

//  Location  of  person  in  2D  space 
public  state  float  x  :  (x+vx) ; 
public  state  float  y  :  (y+vy) ; 

//  The  latest  velocity  for  the  person 
public  state  float  vx  :  nvx; 
public  state  float  vy  :  nvy; 

//  New  velocity 

public  effect  float  nvx  :  sum; 

public  effect  float  nvy  :  sum; 

The  query-phase  for  this  person.  */ 
public  void  run ( )  { 

/ /  Local  variables 
effect  avoidx  :  sum; 
effect  avoidy  :  sum; 
effect  count  :  sum; 

//  Use  "forces"  to  repel  others  that  are  too  close 
foreach (Person  p  :  Extent<Person>)  { 
avoidx  <-  1  /  abs (x  -  p.x) ; 
avoidy  <-  1  /  abs (y  -  p.y) ; 
count  < -  1 ; 

} 

nvx  <-  vx  +  rand ( )  +  avoidx  /  count  *  vx; 
nvy  <-  vy  +  rand ( )  +  avoidy  /  count  *  vy; 

}} 

In  addition  to  this  restriction  on  loeal  variables,  we  prohibit  for-loops  or  while-loops.  The 
only  supported  form  of  iteration  is  a  f  oreach-loop,  whieh  iterates  over  the  elements  of  a  set;  in 
the  above  example,  this  set  is  Extent<Person>,  whieh  is  the  set  of  all  instances  of  the  class 
Person.  These  loops  function  as  a  localized  version  of  the  state-effeet  pattern.  Effect  variables 
declared  outside  of  a  f  oreach-loop  are  write-only  within  the  loop.  However,  effeet  variables 
ean  always  be  read  outside  of  the  f  oreach-loop,  as  that  ean  be  considered  their  update  phase. 
In  the  above  example,  avoidx  is  write-only  in  the  f  oreach-loop,  but  ean  be  read  for  the 
computation  of  nvx  outside  of  the  loop. 

While  these  limitations  may  seem  restrictive  at  first,  we  have  discovered  that  it  is  possible  to 
design  a  large  class  of  behavioral  simulations  within  this  language.  Everything  follows  from  the 


state-effeet  pattern;  onee  that  pattern  is  mastered,  programming  in  BRASIL  is  relatively 
straightforward.  When  we  eombine  this  with  the  ability  to  eompile  BRASIL  into  database  query 
plans,  it  beeomes  a  very  powerful  framework  for  programming  behavioral  simulations. 

Finally,  we  have  integrated  some  features  to  support  transaetions  for  multiple  effeet  fields.  The 
state-effeet  pattern  ensures  that  all  individuals  interaet  with  one  another  simultaneously,  and 
effeets  are  aggregated  together.  When  effeets  may  be  in  eonfliet  with  one  another,  the 
aggregation  method  may  simply  piek  a  single  effeet  and  disregard  the  others.  In  essenee,  this  is 
equivalent  to  aborting  an  assignment  to  an  effeet  field.  However,  sometimes  our  simulation 
programs  may  want  to  eouple  two  effeets  together.  For  example,  in  the  first  Person  seript 
above,  we  may  want  to  assure  that  if  an  assignment  to  the  avoidx  effeet  field  is  aborted,  then 
so  is  an  assignment  to  the  avoidy  field.  This  eoupling  of  assignments  is  very  similar  to  that  of 
transaetions  in  databases,  where  an  entire  eompound  action  is  either  committed  or  aborted. 


3.  An  Optimizing  Compiler  That  Converts  BRASIL  Programs  into  Database 
Query  Plans 

Because  BRASIL  strictly  enforces  the  state-effeet  pattern,  it  is  possible  to  eompile  any  BRASIL 
seript  into  a  database  query  plan.  This  plan  ean  be  represented  in  either  SQL  or  some  other 
query  plan  representation  format.  In  our  work  on  BRASIL,  we  ehose  the  relational  algebra  as 
our  representation  format,  as  it  is  common  to  all  database  query  languages  and  is  one  of  the 
cleanest  representations. 

Our  compiler  performs  a  souree-to-souree  translation  between  BRASIL  syntax  and  relational 
algebra  expressions.  The  language  features  of  BRASIL  have  a  direct  eorrespondenee  to  database 
operations. 

Conditionals  correspond  to  selecting  data  from  a  database  table,  and  f oreach-loops  are 
equivalent  to  joining  two  tables  together.  Assignments  to  an  effeet  variable  translate  into 
aggregating  data  in  a  table. 

One  of  the  major  advantages  of  converting  BRASIL  programs  into  database  query  plans  is  that 
we  ean  use  “rewrite  rules”  from  the  database  literature  to  transform  and  reorder  suboptimal  code. 
For  example,  the  program  code 

effect  counts  :  sum; 
effect  countd  :  sum; 

foreach (Person  p  ;  Extent<Person>)  { 
foreach (Person  q  :  Extent<Person>)  { 
if  (p  ==  q)  I  counts  <-  1;  ) 
else  {  countd  <-  1; 


is  equivalent  to  the  program  code 

effect  counts  :  sum; 
effect  countd  :  sum; 
foreach (Person  p  :  Extent<Person>)  { 
counts  <-  1; 

} 


countd  <-  (counts-1) * (counts-1) ; 


The  first  program  has  two  nested  loops  and  is  mueh  less  efficient.  When  we  convert  these 
programs  into  the  relational  algebra,  a  database  query  processor  will  recognize  that  the  joins  in 
the  first  program  are  redundant  and  eliminate  them.  The  resulting  query  plan  after  this  join 
elimination  will  be  the  same  as  if  we  translated  the  second  program  directly. 

There  arc  many  other  rewrite  rules  that  we  ean  apply  to  the  query  plans  generated  by  BRASIL 
scripts.  Another  example  is  using  database  duplicate  elimination  to  remove  redundant 
computation.  In  simple  experiments  where  the  individuals  have  to  perform  expensive  scientific 
computation  (but  not  mueh  else),  duplicate  elimination  has  resulted  in  performance 
improvements  of  an  order  of  magnitude. 


4.  Optimization  Algorithms  for  Database  Query  Plans  Representing 
Simulated  Behavior 

While  rewrite  rules  allow  us  to  perform  some  optimization  on  our  behavioral  simulations,  they 
are  primarily  useful  in  removing  unnecessary  code.  By  themselves,  they  do  not  help  with  the 
quadratic  behavior  that  results  when  each  individual  interacts  with  every  other  individual.  To  do 
that,  we  need  to  leverage  other  database  techniques,  such  as  automatic  index  generation. 

By  examining  the  query  plans  generated  by  the  BRASIL  compiler,  we  have  seen  that  the  vast 
majority  of  plans  contain  spatial  joins,  and  that  this  is  the  source  of  the  quadratic  behavior.  In  a 
spatial  join,  each  individual  is  paired  with  all  of  the  other  individuals  nearby.  There  arc  many 
indices  in  the  literature  for  improving  database  performance  of  spatial  joins.  However,  these 
indices  only  improve  performance  if,  for  each  individual,  they  ean  filter  out  a  large  number  of 
individuals  that  do  not  interact  with  it.  If  the  individuals  in  the  simulation  arc  close  together, 
then  none  arc  filtered,  and  so  there  is  no  performance  improvement. 

However,  we  have  discovered  that  we  can  improve  performance  even  in  the  case  where  all 
individuals  interact  with  one  another,  through  the  use  of  aggregate  indices.  Unlike  in  standard 
database  query  plans,  the  queries  produced  by  BRASIL  always  follow  a  spatial  join  with  an 
aggregate  computation  that  collapses  each  set  of  pairings  into  a  single  value.  Instead  of  filtering 
out  extra  individuals,  these  indices  precompute  the  aggregate  computation  over  sets  of 
individuals  so  that  we  can  share  computation  between  multiple  individuals. 

Aggregate  indexing  is  not  new,  but  we  can  use  it  in  novel  ways  by  exploiting  the  guarantees 
provided  by  the  state-effeet  pattern.  In  particular,  the  existence  of  separate  query  and  update 
phases  ensures  there  will  be  enough  queries  between  index  updates  to  amortize  the  cost  of  an 
asymptotically  efficient  bulk-build  operation  at  every  tick.  Without  this  guarantee,  we  would 
need  to  use  more  conventional  dynamic  structures,  supporting  insertions  and  deletions,  at  a 
factor  of  log  n  penalty  in  asymptotic  cost.  In  practice,  this  has  resulted  in  an  order  of  magnitude 
improvement  on  simulations  with  only  10000  individuals. 


Other  optimizations  are  enabled  by  the  existenee  of  query  and  update  phases  For  example,  we 
have  been  able  introduee  sweep-line  algorithms  from  eomputational  geometry  into  spatial 
indexing  teehniques  to  further  improve  the  performanee  of  our  simulations. 

5.  Distribution  of  Database  Query  Plans  Representing  Simulated  Behavior 

Most  of  our  work  on  this  projeet  has  involved  the  use  of  database  teehniques  to  optimize 
behavioral  simulations  on  a  single  maehine.  However,  we  have  begun  some  initial  investigations 
in  using  database  teehniques  to  distribute  behavioral  simulations  across  multiple  machines.  The 
primary  issues  are  communication  and  synchronization:  At  the  end  of  each  time  step,  every 
individual  has  to  receive  the  effect  computations  from  all  other  individuals  before  proceeding. 
The  necessary  network  communication  can  impede  the  performanee  of  the  simulation. 

There  are  many  teehniques  in  the  literature  to  reduce  network  overhead  by  limiting 
communication  to  individuals  that  are  “visible”  to  one  another.  However,  simple  visibility  is 
often  not  sufficient.  For  example,  a  simulation  may  have  an  “observer”  object  that  gathers 
statistics  from  all  other  individuals  in  the  database 

Another  issue  with  visibility-restricted  algorithms  is  that  they  are  unable  to  deal  with  the 
transitivity  of  actions;  individuals  can  easily  interact  with  one  another  even  if  they  cannot  see 
one  another.  Suppose  we  have  three  individuals  A,  B,  and  C  where  A  and  B  are  visible  to  each 
other,  and  B  and  C  are  visible  to  each  other,  but  A  and  C  cannot  see  one  another.  A  visibility- 
restricted  algorithm  will  not  send  any  coordinated  messages  between  A  and  C.  As  we  have 
shown,  this  can  lead  to  an  inconsistent  state  in  a  distributed  simulation. 

Fortunately,  when  we  translate  simulations  into  database  queries,  they  have  a  lot  of  semantic 
information  that  can  be  leveraged  to  limit  communication.  Simulations  are  essentially  high- 
dimensional  databases  where  the  attributes  can  change  only  in  predictable  ways.  By  treating 
each  time  step  as  a  database  transaction,  we  were  able  to  take  ideas  from  distributed  databases  to 
limit  communication  overhead  for  a  wide  class  of  behavioral  simulations.  In  particular,  while 
were  able  to  handle  more  complex  behavioral  interactions  than  the  visibility-restricted 
algorithms  eould,  our  algorithms  added  only  1%  overhead  to  the  state-of-the-art  visibility- 
restricted  algorithms. 

We  also  examined  the  issue  of  fault  tolerance  in  distributed  simulations.  All  of  the  individuals  in 
a  simulation  are  interdependent:  If  a  maehine  in  a  distributed  simulation  goes  down  the 
simulation  cannot  proceed  until  it  is  restored.  As  part  of  this  research  we  performed  a  thorough 
experimental  evaluation  of  various  eheekpoint-reeovery  algorithms  in  the  literature.  Our 
analysis  considered  the  overhead  of  supporting  these  algorithms  and  the  expected  recovery  time. 
For  our  simulations,  there  was  no  clear  winner.  For  simulations  with  very  high  update  rates,  a 
naive  snapshot  scheme  performs  best,  while  for  computationally  intensive  simulations  with 
lower  update  rates  the  best  option  is  a  more  sophisticated  copy-on-update  scheme. 


